[RSDK-13303] Fix flaky TestRawClientOperation by draining stream before reading headers#6087
[RSDK-13303] Fix flaky TestRawClientOperation by draining stream before reading headers#6087viam-ci-org-reader[bot] wants to merge 1 commit into
Conversation
…re reading headers Drain the EchoMultiple stream before calling Header(). Over WebRTC, the server-side stream can complete and cancel the context before Header() retrieves the response metadata, causing a "context canceled" error. Receiving all messages first ensures headers have arrived and are cached. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
aldenh-viam
left a comment
There was a problem hiding this comment.
I thought this looked a little strange and asked Claude to dig deeper: this fix would likely work, bit might actually be covering up a race the WebRTC code (and behavioral difference with HTTP/2):
The WebRTC Header() is a different shape from HTTP/2 and has a real race here:
// wrtc_client_stream.go:127 func (s *webrtcClientStream) Header() (metadata.MD, error) { select { case <-s.headersReceived: return s.headers, nil default: } select { case <-s.ctx.Done(): return nil, s.ctx.Err() case <-s.headersReceived: return s.headers, nil } }Two important differences from the HTTP/2 path:
- It returns the error directly — no swallow-and-make-you-call-Recv. s.ctx.Err() lands in the test's err and would fail test.That(t, err, test.ShouldBeNil).
- The stream's ctx is cancelled by trailers, not just by external cancellation. In wrtc_base_stream.go:160, closeWithError (called from processTrailers → closeFromTrailers at wrtc_client_stream.go:378) does s.cancel(). So
the moment the server's trailers arrive at the client, s.ctx.Done() fires — even on a successful RPC.The key difference
Both transports cancel the per-stream context when the stream completes (HTTP/2 does it from trailer/end-of-stream handling, WebRTC does it from closeFromTrailers → closeWithError → s.cancel() at wrtc_base_stream.go:170).
So both races are present in principle — both can be holding "headers were received cleanly" AND "stream's ctx is cancelled" at the same time.What protects HTTP/2 is a guard after the wait, not the wait itself:
// internal/transport/client_stream.go:125 func (s *ClientStream) Header() (metadata.MD, error) { s.waitOnHeader() // races ctx.Done vs headerChan if !s.headerValid || s.noHeaders { // <-- re-check: were headers actually received? return nil, s.status.Err() } return s.header.Copy(), nil }So even if the select inside waitOnHeader woke on ctx.Done(), the post-wait headerValid check returns the headers anyway when they really did arrive. WebRTC's Header() skips that step and trusts whichever case won the
select.(Separately, the outer (*clientStream).Header() in HTTP/2 swallows errors and returns (nil, nil) per gRPC-Go convention, but that's an orthogonal design choice — not what makes the race benign.)
|
@aldenh-viam in this case, it looks like this is still a reasonable fix for the flakyness. or you think we should take a different approach? |
As a workaround quick fix, yes, but I think it would be better to confirm whether the |
|
i think i'm gonna try to get goutils work with this flow too. as there are a number of flaky test that seem to be related to go utils instead |
The check-approvals job posts a Review Required comment on the PR when approvals are unmet. That comment goes through the issues-comments endpoint, but for a pull request resource the GITHUB_TOKEN needs pull-requests: write -- issues: write does not cover PRs. The job only granted pull-requests: read + issues: write, so the POST returned 403 'Resource not accessible by integration' and crashed the check on every Claude PR at open time (e.g. #6087). Verified on the same PR: assign-netcode-reviewer (pull-requests: write) edited the PR fine, while check-approvals (issues: write) 403'd -- confirming the scope, not the token/actor, is the issue. Brings this job in line with the other claude/* workflows (claude-pr-assistant, claude-ci-fix, claude-jira), which all already grant pull-requests: write. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Fixes a flaky test failure in
TestRawClientOperationwhereechoStreamClient.Header()returnscontext canceled.EchoMultiplestream completes nearly instantly (sends one response and returns), which can cancel the stream context before the client callsHeader()to retrieve response metadataRecv()beforeHeader(), ensuring headers are received and cached before the stream context is canceledwrtc_server_stream.go:269: message received after EOSin the failure logs confirms the race between stream completion and header retrievalTest plan
go test -v -run TestRawClientOperation -count=5 ./robot/web/passes consistentlygo vet ./robot/web/passesDRI
@ale7714 is the responsible engineer for this PR.
🤖 Generated with Claude Code